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Abstract 

Background: Calpains are Ca 2+ -dependent cysteine proteases that participate in a range of crucial cellular 
processes. Dysfunction of these enzymes may cause, for instance, life-threatening diseases in humans, the loss of 
sex determination in nematodes and embryo lethality in plants. Although the calpain family is well characterized in 
animal and plant model organisms, there is a great lack of knowledge about these genes in unicellular eukaryote 
species (i.e. protists). Here, we study the distribution and evolution of calpain genes in a wide range of eukaryote 
genomes from major branches in the tree of life. 

Results: Our investigations reveal 24 types of protein domains that are combined with the calpain-specific catalytic 
domain CysPc. In total we identify 41 different calpain domain architectures, 28 of these domain combinations have 
not been previously described. Based on our phylogenetic inferences, we propose that at least four calpain variants 
were established in the early evolution of eukaryotes, most likely before the radiation of all the major supergroups 
of eukaryotes. Many domains associated with eukaryotic calpain genes can be found among eubacteria or 
archaebacteria but never in combination with the CysPc domain. 

Conclusions: The analyses presented here show that ancient modules present in prokaryotes, and a few de novo 
eukaryote domains, have been assembled into many novel domain combinations along the evolutionary history of 
eukaryotes. Some of the new calpain genes show a narrow distribution in a few branches in the tree of life, likely 
representing lineage-specific innovations. Hence, the functionally important classical calpain genes found among 
humans and vertebrates make up only a tiny fraction of the calpain family. In fact, a massive expansion of the 
calpain family occurred by domain shuffling among unicellular eukaryotes and contributed to a wealth of 
functionally different genes. 
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Background 

Calpains are Ca 2+ -dependent cysteine proteases that 
regulate a multitude of intracellular processes by limited 
proteolysis of various substrates [1-3]. Defects in calpain 
function are associated with embryonic lethality in mice, 
muscular dystrophies in humans, gastropathy, tumori- 
genesis and neurogenesis disorders [4-7]. Furthermore, 
calpains function during embryonic development in 
Drosophila and mediate an environmental adapation to 
pH-dependent changes in fungi [8,9]. Plants contain only 
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one calpain gene, Dekl, present in species ranging from 
the moss Physcomitrella patens to angiosperms, where it 
is essential for embryo development, and is proposed to 
have played a critical role in the evolution of differen- 
tiated multicellular plants [10-12]. 

The calpain family belongs to protease clan CA, and is 
classified into classical and non-classical forms based on 
domain architecture [3]. The classical calpains consist of 
four conserved domains: an N-terminal anchor helix 
(Nter), a catalytic protease core domain (CysPc) with the 
two subdomains PCI and PC2, a C2-like domain (C2L) 
and a penta-EF-hand domain (PEF), designated here as 
Nter-CysPc-C2L-PEF. Non-classical calpains lack both 
the Nter and the PEF domain and may contain add- 
itional domains in combination with CysPc [2,3]. The 
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majority of calpain research has focused on the three 
ubiquitously expressed classical calpains CAPN1 and 
CAPN2 in mammals and CAPN11 in birds. Both 
CAPN1 and 2 are closely related 80-kDa proteins and 
are highly conserved at the sequence and structural 
levels [13]. They can form separate heterodimers by 
binding to a common 30-kDa small regulatory subunit 
CAPNS1. Heterodimeric CAPNS1/CAPN1 and CAPNS1/ 
CAPN2 have been named \i- and m-calpain, respectively, 
owing to the micromolar versus millimolar levels of Ca 2+ 
required for their activation [1]. In vivo, calpain activity 
depends on the presence of three key catalytic amino acid 
residues (Cys, His and Asn) in the CysPc domain [14,15], 
although enzymes with substitutions in these residues do 
not always display loss of function [16]. Comparative ana- 
lyses of representative animal genomes have revealed a 
relatively recent expansion of the calpain family and func- 
tional divergence among different paralogs [13,14,17,18]. 
Due to the modular nature of calpains, with many group- 
specific domains, phylogenetic tree construction based on 
the CysPc domain has proven to be the most efficient ap- 
proach to understand the evolutionary divergence of clas- 
sical and non-classical calpains [18,19]. For instance, the 
acquisition of the C-terminal PEF domain was shown, by 
this approach, to be a relatively recent event in the evolu- 
tionary history of classical calpains [20,21]. Using the evo- 
lutionary conservation of distinct calpain modules in 
diverse groups of eukaryotes, non-classical calpains were 
proposed to consist of several subfamilies, including the 
PalB subfamily found in humans, yeasts, fungi, insects and 
nematodes, the SOL subfamily for animals as well as the 
Dekl subfamily that is represented in land plants [3]. The 
phylogenetic relationship among the members of non- 
classical calpains and the sequential evolution of calpain 
modules, however, remains unresolved. 

Until now, studies on the diversity and evolutionary 
history of the calpain gene family have been focused on 
multicellular eukaryotes, including animals, land plants 
and fungi. Although calpain-like genes were also 
reported in several protozoan genomes such as those of 
the apicomplexan and kinetoplastid parasites [22,23], 
the great differences in gene number among those 
lineages (i.e. from one calpain-like gene in Plasmodium 
falciparum to 14 genes in Trypanosoma brucei) indicate 
a large variation of calpain diversity in single-celled 
eukaryotes. In fact, many unknown calpain genes may 
exist among hitherto unexplored unicellular organisms, 
some of which diverged early after eukaryotes arose 
and that display ancient morphological features and 
cellular structures [24,25]. Currently, almost all unicellu- 
lar eukaryotes, animals and land plants have been 
assembled into six major groups (i.e. supergroups) on 
the basis of multi-gene phylogenies and deductions 
from cellular structures and chemistry [26-28]. These 



supergroups comprise Opisthokonta (i.e. animals, fungi 
and Choanozoa), Amoebozoa (e.g. pathogenic amoeboid 
Entamoeba and slime molds Dictyostelium), Excavata 
(e.g. amoeboflagellate Naegleria, parabasalid parasite 
Trichomonas, kinetoplastida parasites Trypanosoma and 
Leishmania), Plantae (e.g. green algae, red algae and 
land plants), SAR (i.e. Stramenopila (e.g. brown algae, 
diatoms and oomycota), Alveolata (e.g. apicomplexa, cili- 
ates and dinoflagellates), and Rhizaria (e.g. cercozoa, for- 
aminifera and radiolaria)) and a loose assemblage of 
Hacrobia (e.g. Haptophyta, Cryptophyta and Telonemia). 
Despite several attempts, the evolutionary relationships 
between these supergroups have not been completely 
resolved, but recent reports suggest that Plantae, SAR 
and Hacrobia are constituting a mega group [29], while 
Opisthokonta and Amoebozoa seem to form another 
mega group [30,31]. In addition to these established 
supergroups, there are a few orphan eukaryote lineages, 
such as Thecamonas and Collodictyon, that may occupy 
distinct positions in the tree of life [32,33]. They 
diverged very early in the history of eukaryotes and are 
crucial for understanding the evolution of eukaryotes. 
As a consequence, a comprehensive investigation of 
calpains distributed in these supergroups and deep 
diverged lineages would represent a key step towards a 
broader classificaton system as well as revealing the evo- 
lutionary events that contributed to the variation in this 
gene family. Importantly, reconstructing calpain phylo- 
genetic profiles among these unicellular lineages may 
help illuminate the origin and evolution of proteolytic 
systems in eukaryotes and establish a practical frame- 
work on which experimental evidence can be compared 
between species. 

In this study, we have searched for calpain genes in 
genomes representing a broad taxonomic sampling from 
all eukaryote supergroups with emphasis on unicellular 
organisms. We present a massively expanded calpain 
gene family, in which a large number of the new calpain 
genes are composed of many other domains than previ- 
ously reported. Evolutionary inferences suggest that 
most of the calpain variants arose from the combina- 
tions of ancient domains through domain shuffling 
mechanisms. Domains derived more recently have con- 
tributed to the innovation of the calpain family by mul- 
tiple independent insertion events. The majority of the 
new calpain variants are considered as non-classical 
types, implying that the classical calpains typical for ani- 
mal and human genomes comprise only a small subset 
of the total gene family. The vast diversity of calpains 
described here provides a new framework for addressing 
the function of calpain genes in unicellular eukaryotes, 
and for elucidating the various levels by which these im- 
portant proteins are regulated to prevent diseases or de- 
velopmental defects in higher eukaryotes. 
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Results and discussion 

A large diversity of domains and genes revealed in the 
calpain family 

Our survey of calpain diversity identified a total of 41 
different domain arrangements, of which 28 have not 
been previously reported (Figure 1). Combinations be- 
tween CysPc and 24 other domains such as hATC (hAT 
family of dimerization domains), LIM (Zinc-binding 



domains present in Lin- 11, Isl-1 and Mec-3), TPR (Tet- 
ratricopeptide repeats), WW (a domain with two highly 
conserved tryptophans) and Zf_GRF (a GRF zinc finger) 
reveal this huge variation in calpain members (see 
Table 1). By searching protein family databases, we 
found homologs of 16 of these domains in either eubac- 
teria or archeabacteria (Table 1), implying the majority 
of domains are ancient and have participated in the 
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Figure 1 Taxonomic distribution of calpain variants (left side) revealed by comparison of genomes from 44 eukaryotic species (top 
section). Black circles indicate calpains with all three active site residues (Cys, His, Asn) intact in the CysPc domain. Grey circles indicate that the three 
CysPc active site residues are partially or completely substituted or missing. Half black and half grey circles indicate that some of detected calpains had all 
three residues, whereas others lacked at least one of the active residues. Black dots show that the calpain domain combination is absent. * indicates a 
novel domain combination identified in this work. All calpains are listed with accession numbers in Additional file 1: Table S1. 
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Table 1 The distribution of domains identified in calpain genes in the three domains of life 


Abbreviation 


Description 


Eukaryotes 


Eubacteria Archaebacteria 


Function 


Reference 


ADK 


Adenylate kinase 


-X 




Cellular energy homeostasis 


Berry MB et al., 
1998 


C2 


Protein kinase C 
conserved region 2 
(CalB) 


-X 


* * 


Signal transduction or membrane 
trafficking 


Ponting CP 
et al., 1996 


C2L 


Calpain domain III 


-X 




Regulation of calpain activity 


Sorimachi H 
et al., 201 1 


CysPc 


Calpain-like protease 
catalytic domain 


-X 




Apoptosis, membrane fusion, cell motility, 
and signal transduction 


Sorimachi H 
et al., 201 1 


DUF1935 


Domain of unknown 
function 1935 


■X 




Unknown 


Bateman A 
et al., 2010 


EF 


EF hands 






Ca 2+ -binding 


Nakayama S 
et al., 1992 


GuKc 


Guanylate kinase 
homologs 




* * 


Active Guanylate kinase catalyze ATP- 
dependent phosphorylation of GMP to GDP 


Kuhlendahl S 
et al., 1998 


hATC 


hAT family dimerization 
domain 






Proposed to participate in regulation of 
transcription 


Rubin et al., 
2001 


LIM 


Zinc-binding domain 
present in Lin-1 1, lsl-1, 
Mec-3 


■X 


* * 


Zn 2+ -binding, mediate protein-protein 
interactions 


Perez-Alvarado 
GC et al., 1992 


Nter 


N-terminal anchor helix 






Participate in classical calpain conformation 
change 


Sorimachi H 
et al., 201 1 


MORN 


Membrane Occupation 
and Recognition Nexus 






Unknown 


Takeshima 
et al., 2000 


PEF 


Calpain domain IV 


■X 




Ca 2+ -binding 


Sorimachi H 
et al., 201 1 


PUB 


PUB domain 


■X 


* 


AAA ATPase binding domain 


Allen MD 
et al., 2006 


SOL 


Small optic lobes 


■X 




Unknown 


Kamei M et al., 
1998 


TM(L/S) 


(Long or Short) 
Transmembrane motifs 






Unknown 


Corall and 
Ersfeld K 2007 


TPR 


Tetratricopeptide repeats 




# * 


Mediate protein-protein interactions 


Palmer CP 
et al., 2004 


UBA 


UBA^S-N domain 




* * 


Ubiquitin binding 


Dieckmann T 
et al., 1998 


UEV 


UEV domain 


■X 




Functions in both HIV-1 budding and the 
vacuolar protein sorting (VPS) pathway in 
human 


Wagner KU 
et al., 2003 


Vps23 


Vps23 core domain 






Stable ESCRT-I complex 


Teo H et al., 
2006 


WW 


Domain with two highly 
conserved tryptophans 






Binds proline-rich polypeptides 


Bork et al., 
1994 


Zf_GRF 


GRF zinc finger 


■X 




Proposed in nucleic acid binding 


Krishna SS 
et al., zuu I 


Zf_CCHH 


Zinc-finger 
(CX5CX6HX5H) motif 


-X 




DNA strand break repair, DNA metabolism 


lies N et al., 

2007 


Zn 


Zinc finger domain in 
Ran-binding proteins 


■X 


* * 


RanGDP binding 


Krishna SS 
et al., 2001 


AAA 


AAA ATPase domain 






ATPases associated with diverse cellular 
activities 


Hanson PI 
et al. 2005 


MIT 


Microtubule interacting 
domain 






Intracellular trafficking 


Phillips SA 
et al., 2001 



Star * indicates the type of domain is present in Eukaryotes, Eubacteria or Archaebacteria. 
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formation of other genes than calpains before the origin 
of eukaryotes. Innovation in the calpain family has there- 
fore taken place by adding both ancient and novel 
domains to the N- or C- terminus of CysPc with variable 
types, numbers and orders (Figure 1 and Additional file 1: 
Table SI). 

Among the investigated species, Thecamonas tra- 
hens, a lineage that phylogenetically belongs to Apu- 
sozoa, shows the highest number of different calpain 
paralogs. It has 12 genes that encode nine calpain 
variants (Figure 1 and Additional file 1: Table SI). 
Three other species, the single-celled phytoplankton 
Emiliania huxleyi, (Hacrobia) the amoeboa-flagella 
Naegleria gruberi (Excavata)and the ciliate Parame- 
cium tetraurelia (SAR) also contain a large number of 
genes encoding seven, five and four calpain variants, 
respectively. In contrast, one single calpain was found 
in all land plants (named Dekl), Entamoeba histoly- 
tica, Spizellomyces punctatus, Plasmodium falciparum, 
Cryptosporidium parvum and Thalassiosira pseudonana. 
No CysPc domain was detected in Giardia intestinalis, 
Dictyostelium discoideum, Chlorella NC64A or red algae 
(Cyanidioschyzon merolae), even if we loosened the 
threshold (e-value < 0.01) in the BLAST searches. Thus, 
we found a large variation in calpain gene numbers in the 
44 representative taxa. Even closed related species, such as 
species in genera Trichomonas and Giardia, have very dif- 
ferent calpain gene content. 

Four ancient eukaryotic calpain domain architectures 

Of all 41 calpain types, we identified that 13 variants are 
present in more than one supergroup. For instance, 
types 1 (CysPc), 3 (CysPc-C2L) and 4 (CysPc-C2L-C2L) 
show a scattered taxonomic distribution (Figure 1). In 
order to determine whether these calpain genes evolved 
only once or on multiple occasions, we reconstructed 
their evolutionary relationships based on the alignment 
of CysPc. Based on the distribution of domain combina- 
tions across the eukaryote tree and the similarity of 
domain components in various genes, we propose that 
four calpain architectures CysPc, CysPc-C2L, MIT- 
CysPc-C2L and TML-CysPc-C2L originated early in the 
evolutionary history of eukaryotes Figure 2, Figure 3 and 
Additional file 2: Figure SI. 

First, the type of calpain containing only CysPc is 
likely the most ancient form, both because it has the 
broadest distribution across eukaryote supergroups, and 
is the only form of calpain present in eubacteria [18]. As 
seen from Figure 1, this variant is found in Opistho- 
konta, Excavata, SAR, Plantae, Hacrobia and Thecamo- 
nas. The clustering of CysPc calpain into seven different 
clades (Figure 2 and Additional file 2: Figure SI), sug- 
gests that several ancient paralogs were established by 
multiple independent duplications and subsequently 



diverged into distinct forms. Interestingly, within some 
of the CysPc clades, a few sequences are composed of 
multiple domains (e.g. types 36 and 27), implying that 
the different CysPc paralogs have recruited other 
domains along the evolution of the eukaryotes at several 
independent occasions. In order to investigate whether 
the eukaryotic CysPc paralogs derived from bacteria 
once or many times, we added several prokaryote CysPc 
domains to the alignment and repeated the phylogenetic 
reconstruction (see Additional file 3: Figure S2). In the 
resulting tree, the bacterial paralogs are weakly recov- 
ered as a monophyletic group (BP <50%, PP_LG = 0.68 
and PP_CAT = 0.53 in Additional file 3: Figure S2; BP is 
bootstrap support; PP_LG/PP_CAT is the bayesian pos- 
terior probability inferred under LG/CAT model) with 
affinity to one of the eukaryotic CysPc clades, indicating 
that all eukaryote CysPc variants may have evolved from 
a single bacterial paralog. 

The calpain type 3 (CysPc-C2L) probably arose soon 
after the early duplication of domain CysPc based on its 
presence in nearly all supergroups (e.g. Opisthokonta, 
Excavata, Hacrobia and SAR). In addition, it was found 
in Thecamonas. This gene may have been lost early in 
Amoebozoa and Plantae (Figure 1 and Figure 2). Several 
CysPc-C2L calpains show variable numbers of C2L (i.e. 
types 4-6), suggesting multiple tandem duplications 
of C2L. Since Figure 2 and Additional file 2: Figure SI 
both show CysPc-C2L calpains as a paraphyletic clade, 
independent C2L domains may have been added to the 
different CysPc paralogs early in eukaryote evolution. 
Furthermore, the phylogeny indicates that CysPc-C2L 
calpains have subsequently recruited other types of 
domains on several independent occasions giving rise 
to new calpain genes (Figure 2 and Additional file 2: 
Figure SI). 

The third ancestral calpain subfamily is composed of 
the MIT-CysPc-C2L variants (i.e. type 13-18; Figure 1). 
The members in this subfamily are present in most 
eukaryote supergroups, and they are strongly recovered 
as a monophyletic clade in the phylogeny (93% BP, 0.99 
PP_LG and 1.00 PP_CAT in Figure 2 and Additional file 
2: Figure SI). The MIT- CysPc- C2L calpains may have 
been formed by a integration of a microtubule interact- 
ing and transport (MIT) module at the N-terminus 
of the CysPc-C2L. Our survey revealed a broad distribu- 
tion of this architecture in Opisthokonta and Apusozoa 
(Figure 1). In addition, we identified several calpains 
with highly divergent MIT domains in the genomes of 
Ectocarpus, Thalassiosira, Fragilariopsis and Leishmania 
(Additional file 2: Figure SI). This broad distribution of 
the MIT domain among eukaryotes indicates that it 
might have been present since the very early stages of 
calpain evolution. In unicellular Opisthokonta and 
deeply diverging animals, a few taxa (i.e. Thecamonas, 
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Figure 2 Eukaryotic calpain phylogeny reconstructed by Bayesian inference of the CysPc domain sequence. Several lineages that share 
similar domain combinations are collapsed into major clades for details of species names and protein domains, see Figure 1 and Additional file 2: 
Figure SI. Typical domain organization is schematically displayed in each clade. The names of domains are shown on the bottom with different 
symbols. Only the genus names of sampled taxa are listed. The clades representing four proposed ancestral domain architectures are color-coded 
(i.e. CysPc in dark brown, CysPc-C2L in blue, MIT-CysPc-C2L in light green and TML-CysPc-C2L in dark green). For each node, statistical support 
values are marked (numbers from left to right: Bayesian posterior probabilities (PP) inferred under LG /CAT models and maximum-likelihood 
bootstraps (% BP) inferred using PROTGAMMALG model). Dashes '-' indicate support values < 50% BP or 0.5 PP. 
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Figure 3 Proposed origin of calpain domain architectures mapped on the global eukaryote phylogeny [26-28,33]. Black bars on the 
branches indicate the hypothetical origin of specified domain combinations. Calpain variants found in one taxon or among closely related 
species are schematically displayed within the supergroup rectangles, which likely constitute lineage-specific domain combinations. 



Capsaspora and Nematostella) have one duplicated C2L 
at the C-terminus, forming MIT-CysPc-C2L-C2L. In 
contrast, the presence of one duplicated MIT domain at 
the N-terminus of CysPc is observed in vertebrates (BP = 
86%, PP_LG = 0.99 and PP_CAT = 0.81; Additional file 4: 
Figure S3), implying that the calpain architecture in the 
ancestral vertebrate may have been MIT-MIT-CysPc- 
C2L/C2L. 

The fourth most ancient architecture is represented in 
calpain types 19-22, consisting of a large transmem- 
brane domain TML, with more than 15 transmembrane 
segments linked to the N-terminus of CysPc-C2L. These 
are found in the Excavata, Plantae and SAR supergroups 
as well as in Thecamonas, and are inferred to be mono- 
phyletic in the calpain phylogeny (BP < 50%, PP_LG = 
0.99 and PP_CAT = 0.66; Figure 2 and Additional file 2: 
Figure SI). Recent multi-gene phylogenomic trees ro- 
bustly support Thecamonas branching as a sister to 
Opisthokonta [31]. Thus, the broad presence of this par- 
ticular calpain type implies that the TML-containing cal- 
pain gene most likely originated very early but was 
subsequently lost in the common ancestors of Amoebozoa 
and Opisthokonta (Figure 1 and Figure 2). Interestingly, 
gene expansions within this subfamily are observed in the 
Excavata species Trichomonas, Seven paralogs of TML 



calpains in Trichomonas are robustly recovered as a 
monophyletic group. They all show the absence of the 
C2L domain, indicating that it may have been lost in a sin- 
gle event before the multiple duplications of the gene 
(Additional file 2: Figure SI). In contrast to other lineages, 
only one type of calpain gene with a TML domain (TML- 
CysPc-C2L), named Dekl, is found in land plants ranging 
from Physcomitrella to Oryza. We did not identify any 
Dekl -like calpains carrying a TML module in the genomes 
of Chlamydomonas, Volvox, Micromonas or Ostreococcus. 
Altogether, the wide distribution of the TML domain 
among eukaryotic calpains suggests that this domain com- 
bination was likely formed before the split of land plants 
and green algae, then secondarily lost in the early evolu- 
tion of the Chlorophyta green algae. 

Implications of an alternative eukaryote root 

The evolutionary events that contributed to the assem- 
bly of these four ancient calpain subfamilies are sug- 
gested to have occured before the bipartition of 
'Opisthokonta- Amoebozoa and 'Excavata- Plantea-SAR- 
Hacrobia (Figure 3) in accordance with the previously 
proposed hypothesis for the eukaryotic root [28,34]. 
However, we still deduced these four calpain types as an- 
cient forms if we change the root of eukaryotes to the 
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Excavata as recently proposed [35] (See Additional file 5: 
Figure S4), because of the broad taxonomic distribution 
of the four subfamilies and their presence in key Exca- 
vata species. 

Lineage specific gene innovations 

Several calpain domain architectures appear to be spe- 
cific for one or more eukaryote supergroups, and are 
therefore most likely generated before the massive speci- 
ation in these lineages along the eukaryote tree. One of 
these, Zn-CysPc, seems to be possessed by Opisthokonta 
(Additional file 3: Figure S2 and Figure 3; BP < 50%, 
PP_LG = 0.55 and PP_CAT = 0.65) while another domain 
combination, Zf_GRF-CysPc, is shared by SAR, Hacrobia 
and Plantae (BP = 79%, PP_LG = 0.99 and PP_CAT = 0.97; 
Additional file 3: Figure S2 and Figure 3). Both the Zn 
and Zf_GRF domains have multiple finger-like protru- 
sions that make tandem contacts with their target mole- 
cules, but their distinct motifs and distributions among 
eukaryotes suggest independent origins (Figure 3). 

Novel domain architectures are also found to be spe- 
cific for certain groups of eukaryotes (Figure 3). For 
instance, an insertion of one SOL module to the 
C-terminus of Zn-CysPc was observed only in Caenor- 
habditis, Drosophilia, Homo and Mus, indicating that 
Zn-CysPc-SOL may have formed after the origin of 
Metazoa. Moreover, our survey shows no indications of 
the classical calpain architecture (Nter-CysPc-C2L-PEF) 
outside the Metazoa and hence strengthens the notion 
that these genes are animal-specific [3] (Additional file 
6: Figure S5). Both Nter and PEF were likely added to 
the N- and C-termini, respectively, of CysPc-C2L var- 
iants in the last common ancestor of Metazoa [2]. In 
some calssical calpains, PEF domains contribute to pro- 
tein dimerization (either homodimerization or heterodi- 
merization). However, based on recent data, it seems 
that the occurrence of PEF and Nter domains is not dir- 
ectly associated with dimerization process in all classical 
calpains. We also found that two calpain variants with 
an EF-hand module are present in Paramecium (type 32: 
EF-CysPc-C2L) and Tetrahymena (type 33: EF-CysPc), 
respectively (Figure 1). However, their motifs are not 
identical to PEF, indicating independent insertion of the 
EF-hand domains. The EF-hand modules found in ciliate 
calpains may be involved in binding calcium ions [36], 
but it is uncertain what concentration of calcium ions is 
required for regulating these calpains. 

A few calpain domain combinations seem to have been 
derived very recently in a smaller group of organisms. 
For instance, only the closely related species Trypano- 
soma and Leishmania share the domain architecture 
DUF1935-CysPc; hence this combination is exclusive 
for Trypanosomatidae (BP = 97%; PP_LG = 1.00 and 
PP_CAT = 0.74; Figure 2 and Figure 3). Calpain type 36, 



composed of CysPc-GuKc-ADK-AAA, is only identified 
in Chlamydomonas and Volvox. It may have been cre- 
ated by the insertion of domains GuKc, ADK and AAA 
into the C-terminus of the CysPc paralog, therefore 
representing a gene unique for Chlorophyta green algae 
(Figure 3). The gain of PUB and UBA modules at the N- 
termini of Emilinia CysPc paralogs shows that the two 
ombinations are probably unique to this particular group 
of Haptophyta. The close relationship of PUB-CysPc and 
UBA-CysPc in the phylogenetic tree (Figure 3) implies 
they probably share a common origin (BP = 100%, 
PP_LG = 1.00 and PP_CAT = 0.99; Additional file 2: 
Figure SI). Two other calpain variants, one with simple 
transmembrane motifs (TMS) and the other with 
MORN, are both found in Tetrahymena and Parame- 
cium. They seem to be shared by ciliates (BP = 88%, 
PP_LG = 1.00 and PP_CAT = 0.88 for CysPc-TMS; BP = 
94%, PP_LG = 1.00 and PP_CAT = 1.00 for MORN- 
CysPc; Additional file 2: Figure SI). As we show above, 
many other domains were added to the N- or C- termini 
of CysPc paralogs by domain-wise evolutionary events 
and formed lineage specific arrangements [37-39]. 

Two patterns of domain shuffling in the calpain gene 
family 

Altogether, the data reveal two different patterns in the 
evolution of the calpain gene family. First, the majority of 
domains in calpains are ancient and have existed as part of 
other genes and not in combination with CysPc since be- 
fore the origin of eukaryotes. The early evolution of the 
calpain gene family therefore most likely occurred by do- 
main shuffling of ancient domains. During the evolution- 
ary history of eukaryotes, domains have been added 
successively to create increasingly complex genes. In 
addition, there are several examples of secondary losses or 
modification of domains. For instance, in the MIT-calpain 
clade in Figure 2 and Additional file 2: Figure SI, several 
sequences either lack or have highly divergent MIT 
domains, indicating that the MIT domains have been sec- 
ondarily lost or modified at several occasions. Second, in 
contrast to this dominating pattern of evolution, we also 
find nine domains in the calpains which are not present 
among prokaryotes (see Table 1). Such domains have con- 
tributed to the generation of genes unique for specific 
supergroups of eukaryotes and hence constitute lineage- 
specific paralogs. Genes specific to certain eukaryote 
groups therefore seem to have evolved either by combin- 
ing only ancient domains, such as UBA, GuKc and ADK, 
or by combining a mix of ancient and novel domains, such 
as the Trypanosomatidae-specific DUF1935. 

Uncover new functions in non-classical calpains 

Up to now, insight into the regulatory mechansims and 
physiological functions of calpains has mainly been 
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based on studies of the classical calpains in mammals. 
Obviously, the discovery of a large variety of non- 
classical calpains creates expectations of many new func- 
tions and regulatory mechansims for calpains yet to be 
uncovered. One important source for new functions lies 
in the large number of non-classical calpains where 
one or more of the three amino acid residues essential 
for enzyme catalytic function have been replaced (grey 
and half grey circles in Figure 1 and more details in 
Additional file 1: Table SI). In particular, the occurrence 
of such substitutions is prominent for Emilinia, Ectocar- 
pus as well as Trypanosomatidae [22], where more than 
50% of the calpains variants show partial or complete 
loss of catalytic site residues. The findings support earl- 
ier suggestions that these variants may have divergent 
functions that do not rely on proteolytic processing [2]. 
This is demonstrated by the recent finding that links 
CAPN6 to a non-proteolytic function in eutherians, 
where the active site Cys is replaced with Lys. The non- 
classical CAPN6 functions as a microtubule-stabilizing 
protein [40]. Moreover, CAPN5, CAPN6, CAPN7 and 
CAPN10 have been suggested to share a similar function 
in the regulation of microtubule stability due to their 
comparable architectures with tandem C2L/C2 domains 
at the C-terminus of CysPc [41]. Thus, they are classified 
together in the PalB subfamily. Yet, this grouping is not 
supported by the present analysis. As discussed above, 
CAPN7 (MIT-MIT-CysPc-C2L-C2L) shows an ancestral 
origin (Figure 2 and Additional file 2: Figure SI). In 
contrast, CAPN6 and CAPN10 seem to have arisen 
more recently. CAPN6 is placed as sister to CAPN5, 
supported by 64% BP, 0.99 PP_LG and 0.97 PP_CAT 
(Additional file 6: Figure S5). Both of them are grouped 
with CAPN10 close to the clade of the classic calpains 
(CAPN13 and CAPN14; Additional file 6: Figure S5). 
Therefore, despite the common domain features, the 
clustering of CAPN7, CAPN6 and CAPN10 in different 
groups, indicate that they may have acquired different 
functions. 

The non-classical calpain Dekl is localized to the 
plasma membrane where it is proposed to be activated 
by a transmembrane anchor [12]. It has been suggested 
that the Dekl homolog of the protist Tetrahymena may 
have been acquired by lateral gene transfer from a green 
alga- type endosymbiont of ciliates [2]. With the addition 
of more TML-calpains to the phylogeny, the Dekl -like 
calpain in Tetrahymena now clusters weakly as a sister 
to those in Thecamonas (Additional file 2: Figure SI). 
This result does not support the lateral gene transfer hy- 
pothesis to explain the expansion of Dekl -like genes in 
taxa outside land plants. On the contrary, our phylogeny 
indicates an early origin of Dekl -like genes and subsequent 
divergent evolution of the amino acid sequences. Interest- 
ingly, one mammalian non-classical calpain (CAPN15) 



shows significant similarity to the CysPc domain of Dekl- 
like calpains (almost 40% amino acid sequence identity) 
[18]. However, the phylogenetic tree does not indicate that 
they share the same origin. In addition, we observed com- 
plicated multiple domain architectures of TML-calpains in 
the taxa of Stramenopila. They share a common arrange- 
ment C2-TML-CysPc. For Ectocarpus, the TPR domain 
was inserted to the N- terminus of C2-TML- CysPc. By con- 
trast, in Albugo and Phytophthora, the WW domain was 
anchored to the C-terminus of the sequence. Overall, we 
conclude from these studies that a high degree of sequence 
divergence and a variety of multiple domain architectures 
in TML-calpains provide a promising system to elucidate 
the functional significance of their membrane association. 

The simplest calpain variant consisting only of the sin- 
gle cysteine protease domain CysPc is rare in multicellu- 
lar organisms. Rather, it is prevalent in primitive protists 
and algae. Likewise, we observed that a few non-classical 
calpains do not contain the C2L domain, while all of 
non-classical calpains lack the PEF domain. The regula- 
tion of these calpain forms is currently unknown. How- 
ever, based on the observation that the two Ca 2+ binding 
sites of CysPc are implicated in the regulation of clas- 
sical calpains, one possibility is that Ca 2+ is the main 
regulator of these enzymes as well [42,43]. Classical cal- 
pains, on the other hand, have evolved multiple levels of 
control over their proteolytic activities imposed by inter- 
actions between CysPc and C2L, PEF, the small regula- 
tory subunit as well as the calpastatin inhibitor [44]. 
Based on the elucidation of the evolutionary history of 
calpains presented here, it is now possible to study each 
control step separately in the different calpain variants 
shown in Figure 1. Ultimately, this insight may contrib- 
ute to novel strategies for controlling calpains in human 
pathologies and further the progress of research into cal- 
pain function in general. 

Conclusion 

Calpains in vertebrates and land plants are known to be 
crucial for a multitude of physiological and intracellular 
processes. Here, we report a massive expansion of the 
calpain gene family in unicellular eukaryotes, many of 
which arose by combining CysPc with protein domains 
previously unrecognized in this family. Phylogenetic 
inferences support the hypothesis that four calpain gene 
variants may have been formed in the early evolution of 
eukaryotes by assembly of ancient domains that were 
already present among prokaryotes. The lineage-specific 
calpain genes, however, were formed through shuffling 
of both ancient and novel eukaryote-specific domains. 
Overall, comparative genomic analyses of this family es- 
tablish a framework for understanding the evolutionary 
mechanisms involved in the origin and expansion of 
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eukaryote calpain genes, and it provides a basis for in- 
vestigating cellular functions of calpain genes. 

Methods 

Taxonomic sampling 

We sampled 34 unicellular eukaryotic organisms for 
surveying and characterizing calpain diversity. These 
taxa represented a wide selection from the proposed 
supergroups including the Choanozoa (i.e. Capsospora 
owczarzaki and Salpingorca rostta), the basal fungi (i.e. 
Spizllomyces punctatus, Batrachochytrium dendrobatidis 
and Mortierella verticillata), the pathological parasites (i.e. 
kinetoplastid, apicomplexan, Trichomonas vaginalis and 
Entamoeba histolytic), green algae (i.e. Chlamydomonas 
reinhardtii and Volvox carteri), brown algae (i.e. Ectocar- 
pus siliculosus, Phytophthora infestans and Thalassiosira 
pseudonana), phagotrophic protists (i.e. Thecomonas tra- 
hensy Tetrahymena thermophila and Paramecium tetraute- 
lia) and marine phytoplankton (i.e. Bigelowiella nartens, 
Guillardia theta and Emiliania huxleyi). In addition, three 
early diverged multicellular species (i.e. Amphimedon 
queenslandica, Trichoplax adhaerens and Nematostella 
vectonsis) in Metazoa together with seven representative 
land animals and plants were involved in taxonomic sam- 
pling as well. For a full overview of the species in our ana- 
lyses and how they are related to the corresponding 
supergroup, see Additional file 1: Table SI. 

Comparative genome analyses 

A primary BlastP or tBlastN [45] search was performed 
using the CysPc domain of CAPN1 and CAPN2 as query 
sequences against protein and genome databases of the 44 
aforementioned eukaryotic organisms (Genbank eukaryote 
genomes, non-redundant databases, the Institute for Gen- 
omic Research, Joint Genome Institute Resource, the 
Broad Institute of Harvard and MIT) to identify genes 
containing the CysPc domain (e-value < e" 10 ). We used 
a relatively strict criterion to collect calpain-like genes 
with a high-quality sequence. Specifically, we included 
sequences that showed more than 40% overlap with the 
CysPc query. Here, one gene encoding a calpain-like cyst- 
eine protease' was found in Dictyostelium discoideum AX4 
[46], but this sequence was excluded from our dataset due 
to lacking significant hits in the CysPc domain region. 
Each calpain-like sequence was then searched against the 
protein conserved domain database (CDD), SMART data- 
base and the Pfam database to annotate domain modules 
[47-49]. We only selected domain modules with a signifi- 
cant annotation in Pfam, SMART or CCD. But for MIT 
module detection, we loosened the threshold and included 
a few sequences with derived domains in order to better 
understand the origin of MIT calpains and the pattern of 
MIT loss and modifications. Since boundaries of a domain 
region annotated by the three database searches were not 



identical, we used the consensus of all significant domains 
predictions. In cases where domains of different character- 
istics overlapped, we regarded the annotation too uncer- 
tain to be included in the presentation. For the CysPc 
domain, we illustrated the domain based on the Pfam an- 
notation only since the annotations from different 
databases showed a sequence overlap > 90%. For other 
domains, we used the consensus from the annotations of 
different databases to define the domain. When different 
annotation approaches identified variable number of do- 
main repeats, we only show one copy to represent that do- 
main. For the identification of the three catalytic sites in 
CysPc domain, we combined the prediction given by the 
Pfam database and the pairwise comparison of each pro- 
tein studied here vs. the classical CysPc domain. Trans- 
membrane motifs were predicted by the TMHMM Server 
v. 2.0 [50]. The detailed information (i.e. accession number, 
the type of domain combination and number of replace- 
ments in the three catalytic sites) of all calpain sequences 
presented here are listed in Additional file 1: Table SI. 

Construction of the alignment 

The CysPc sequences were then aligned by MAFFT using 
the L-INS-i algorithm [51], following by manual editing in 
MacClade 4.0 [52]. Only positions that were unambigu- 
ously aligned were included in the further analyses. Align- 
ments that sampled 259 sequences were initially analyzed, 
and then reduced by removing 12 sequences (for details, 
see Additional file 1: Table SI) because of their highly 
divergent sequences prone to generating phylogenetic 
artefacts. The final alignment consisted of 247 sequences 
and 202 sites available at http://www.mn.uio.no/bio/eng- 
lish/people/aca/kamran/data/Calpain_final_247_accession. 
dat (the original alignment before editing can be down- 
loaded from http://www.mn.uio.no/bio/english/people/ 
aca/kam-ran/data/Calpain_final_247_accession_original.dat). 
ProtTest 3.0 was used for amino acid substitution model 
selection using the Akaike Information Criterion (AIC) to 
choose the best-fitting tested model (LG + GAMMA) for 
phylogenetic analyses [53]. 

Phylogenetic analyses 

Reconstruction of maximum likelihood (ML) phylogenies 
from the calpain sequence alignment was performed using 
RAxML v7.2.6 [54]. For multiple CysPc domains found in 
calpain sequences (e.g. types 2, 9, 10 and 12 in Figure 1), all 
of them were included in the phylogenetic analyses. The 
best topology was determined after 100 heuristic searches 
starting from 100 different random trees under the PROT- 
GAMMALG model. Statistical support was evaluated with 
500 bootstrap pseudo-replicates under the same model as 
in the initial tree search. Bayesian phylogenies were inferred 
by Phylobayes v3.2 under both the LG model and the CAT 
mixture model in combination with four gamma categories 
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for approximating the rate heterogeneity across sites [55]. 
Two independent Markov Chains Motor Carlo (MCMC) 
starting from two random trees were run for 50,000 cycles 
with one tree being sampled every cycle. Consensus top- 
ology and posterior probability values were calculated from 
saved trees after discarding 10,000 cycles as burn-in. Con- 
vergence between the two chains was ascertained by exam- 
ing the difference in frequency for all their bipartitions 
(maxdiff < 0.15 in all analyses). 

Additional files 



the phylogeny of MIT calpains can be downloaded from http://www.mn.uio. 
no/bio/english/people/aca/kamran/data/Calpain_MIT_accession.dat. 

Additional file 5: Figure S4. Proposed origin of calpain superfamily 
domain combinations shown on the global eukaryote phylogeny. The 
tree is rooted according to Cavalier-Smith T 2010. Biol Lett 6(3):342-345. 
Red bars on the branches indicate the hypothetical origin of specified 
domain combinations. Calpain variants only found in one taxon or 
among closely related species are marked in red within the supergroup 
rectangles, which likely constitute lineage-specific domain combinations. 

Additional file 6: Figure S5. Phylogeny of calpains in Metazoa. The 
topology is obtained from the consensus between two independent 
Bayesian inferences. The representative taxa sampled here include Homo 
species, Mus musculus, Gal I us gallus, Xenopus laevis, Danio rerio, 
Nematostella vectensis, Drosophilo melanogaster, Caenorhabditis elegans, 
Trichoplas adhaerens and Amphimedon queenslandica. Support values are 
marked at the nodes, and number from left to right represent Bayesian 
posterior probabilities (PP) inferred under CAT /LG models and the 
maximum-likelihood bootstraps (% BP) inferred using PROTGAMMALG 
model. All support values more than 80% BP and 0.8 PP or more than 
50% BP and 0.5 PP are marked by full or open circles. Dashes '-' indicate 
the support values < 50% BP or 0.5 PP. The branches and clades are 
assigned to the numbers that represent specified domain combinations 
listed in Figure 1. The alignment used to construct this phylogeny can be 
downloaded from http://www.mn.uio.no/bio/english/people/aca/kamran/ 
data/Calpain_Metazoa_accession.dat. 
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